查看原文
其他

【Hummer 引擎优化系列】究极GC难题定位记

U4内核技术 2022-07-13

Abstract

本文记录升级到flutter 2.2.3后出现的数个诡异的、无法重现的top崩溃的分析定位过程。

崩溃的样子

这个崩溃有多种情况,并且包揽了top5崩溃中的几个:

共同点是都崩溃在了Expando.[]=里面。根据代码可以知道这其实是Expando._rehash inline到了Expand.[]=。

for (var i = 0; i < old_data.length; i++) { var entry = old_data[i]; if (entry != null) { // Ensure that the entry.key is not cleared between checking for it and // inserting it into the new table. var val = entry.value; var key = entry.key; if (key != null) {=> this[key] = val; } } }}


箭头处插入了AssertAssignable检查val是否能够赋值到T类型。就在此处进入TypeCheck或者RuntimeTypeIsSubtypeOf等函数。
检查val的值,发现变成了FreeListElement:

可见val对象在91c5edd0,class id在高16位也就是1。查阅头文件可以知道1是FreeListElement。说明此处已经被sweeper sweep了,也就是没有被marker mark过。这是个use after free错误。
观察这些崩溃可知他们都发生在Expando,而且共同点是都
被WeakProperty引用。下面是Expand.[]=的代码:

if (_used < _limit) { var ephemeron = new _WeakProperty(); ephemeron.key = object; ephemeron.value = value; _data[idx] = ephemeron; _used++; return;  }

分析定位过程

由上面的源码可见引用关系是Expando Instance=>_data array=>WeakProperty Instance=>value。
究竟发生了什么情况导致了value被mismark呢?查阅了marker里面的ProcessWeakProperty的代码觉得没什么显著的问题。但是跑monkey还是能重现的,那么我们就打一些log吧。思路是在rehash和[]=的时候都打印一些当时的状态。

DEFINE_NATIVE_ENTRY(WeakProperty_validate, 0, 2) { GET_NON_NULL_NATIVE_ARGUMENT(WeakProperty, weak_property, arguments->NativeArgAt(0)); GET_NON_NULL_NATIVE_ARGUMENT(String, tag, arguments->NativeArgAt(1)); ObjectPtr value = weak_property.value(); ObjectPtr key = weak_property.key(); OS::PrintErr( "WeakProperty_validate %s: weak: %p, IsOld: %d: value: %p, key: %p, " "value cid: %zx, key cid: %zx, old space phase: %d.\n", tag.ToCString(), weak_property.ptr()->untag(), weak_property.ptr()->IsOldObject(), value->untag(), key->untag(), value->GetClassId(), key->GetClassId(), thread->heap()->old_space()->phase()); return Object::null();}...@@ -85,6 +90,8 @@ class Expando<T> { var ephemeron = new _WeakProperty(); ephemeron.key = object; ephemeron.value = value;+ if (_should_validate)+ ephemeron.validate("from []="); _data[idx] = ephemeron; _used++; return;... // Ensure that the entry.key is not cleared between checking for it and // inserting it into the new table.+ if (_should_validate)+ entry.validate("from _rehash"); var val = entry.value; var key = entry.key; if (key != null) {

崩溃的时候收集到下面的log:




可以看到崩溃目标对象(黄色标)在Concurrent Marking(gc phase为1)阶段发生了_rehash。申请到了Old Space里面分配的WeakProperty(图1)。然后在Parallel Sweeping阶段再次发生了_rehash,这次从New Space里面分配了WeakProperty(图2)。最后是崩溃,在gc阶段为Done,发生_rehash,此刻value的cid已经为1了,也就是FreeListElement。

原因分析

这一刻事情已经很明了了。因为Concurrent Marking这阶段分配的对象会马上mark:

紧接着会进入Defer Marking Stack:

void StubCodeCompiler::GenerateAllocateObjectSlowStub(Assembler* assembler) {... __ CallRuntime(kAllocateObjectRuntimeEntry, 2);
// Load result off the stack into result register. __ ldr(kInstanceReg, Address(SP, 2 * target::kWordSize));
// Write-barrier elimination is enabled for [cls] and we therefore need to // ensure that the object is in new-space or has remembered bit set.=> EnsureIsNewOrRemembered(assembler, /*preserve_registers=*/false);...static void EnsureIsNewOrRemembered(Assembler* assembler, bool preserve_registers = true) {...=> __ CallRuntime(kEnsureRememberedAndMarkingDeferredRuntimeEntry, 2);...DEFINE_LEAF_RUNTIME_ENTRY(uword /*ObjectPtr*/, EnsureRememberedAndMarkingDeferred, 2, uword /*ObjectPtr*/ object_in, Thread* thread) { ObjectPtr object = static_cast<ObjectPtr>(object_in);... // For incremental write barrier elimination, we need to ensure that the // allocation ends up in the new space or else the object needs to added // to deferred marking stack so it will be [re]scanned. if (thread->is_marking()) {=> thread->DeferredMarkingStackAddObject(object); }

到了marker,因为已经WeakProperty已经marked,所以不会放进Weak集合里面完成定点处理。导致了最终的value变成dangle指针:

void ProcessDeferredMarking() { ObjectPtr raw_obj; while ((raw_obj = deferred_work_list_.Pop()) != nullptr) { ASSERT(raw_obj->IsHeapObject() && raw_obj->IsOldObject()); // N.B. We are scanning the object even if it is already marked. => bool did_mark = TryAcquireMarkBit(raw_obj);// did_mark在这里永远为假。 ... size = ProcessWeakProperty(raw_weak, did_mark); intptr_t ProcessWeakProperty(WeakPropertyPtr raw_weak, bool did_mark) { // The fate of the weak property is determined by its key. ObjectPtr raw_key = LoadPointerIgnoreRace(&raw_weak->untag()->key_); if (raw_key->IsHeapObject() && raw_key->IsOldObject() && !raw_key->untag()->IsMarked()) { // Key was white. Enqueue the weak property. if (did_mark) {=> EnqueueWeakProperty(raw_weak);// did_mark为假,无法进入Weak集合

另外由于WeakProperty的new和 key value的store放在一起会触发优化器标记这两个store没有write barrier。在一般情况下这是对的,因为new WeakProperty会返回New Space Object,这些store不需要被rememebered。但在Concurrent marking的情况下,由于new WeakProperty会返回Old Space的Object,这情况下其实需要WriteBarrier。假如有write barrier, 那么key 和value都会被write barrier stub当场mark掉然后丢进mark stack。也就不会出问题,这也是1.X没有问题的原因。

我把分析结果报告给了dart的工程师雨果洛夫,他在github上报了bug,并且对我表示感谢。

总结

这是我遇到top 5 难的bug了吧。多个边界条件在一起碰撞导致的bug,非常考验我对GC、代码生成、对象layout等模块的熟悉程度和想象力。
连dart团队本身遇到这种问题都跟踪了很久。其他团队遇到了肯定一筹莫展。
遇到这些疑难问题如何能保证不慌?当然是选择技术沉淀深厚的hummer、U4团队的作品。强力的团队为你的产品稳定性护航,赋能亿万用户。


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存